Segmentation and alignment of parallel text for statistical machine translation

نویسندگان

  • Yonggang Deng
  • Shankar Kumar
  • William J. Byrne
چکیده

We address the problem of extracting bilingual chunk pairs from parallel text to create training sets for statistical machine translation. We formulate the problem in terms of a stochastic generative process over text translation pairs, and derive two different alignment procedures based on the underlying alignment model. The first procedure is a now-standard dynamic programming alignment model which we use to generate an initial coarse alignment of the parallel text. The second procedure is a divisive clustering parallel text alignment procedure which we use to refine the first-pass alignments. This latter procedure is novel in that it permits the segmentation of the parallel text into sub-sentence units which are allowed to be reordered to improve the chunk alignment. The quality of chunk pairs are measured by the performance of machine translation systems trained from them. We show practical benefits of divisive clustering as well as how system performance can be improved by exploiting portions of the parallel text that otherwise would have to be discarded. We also show that chunk alignment as a first step in word alignment can significantly reduce word alignment error rate.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Deeper than Words: Morph-based Alignment for Statistical Machine Translation

In this paper we introduce a novel approach to alignment for statistical machine translation. The core idea is to align subword units, or morphs, instead of word forms. This results in a joint segmentation and alignment model, aimed to improve translation quality for morphologically rich languages and reduce the size of the required parallel corpora. Here we focus on translating from inflection...

متن کامل

Enhancing Statistical Machine Translation with Character Alignment

The dominant practice of statistical machine translation (SMT) uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system, which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this, we propose a framework that uses two di...

متن کامل

Dependency Treelet Translation: Syntactically Informed Phrasal SMT

We describe a novel approach to statistical machine translation that combines syntactic information in the source language with recent advances in phrasal translation. This method requires a source-language dependency parser, target language word segmentation and an unsupervised word alignment component. We align a parallel corpus, project the source dependency parse onto the target sentence, e...

متن کامل

Sequence segmentation for statistical machine translation

In the last decade, while statistical machine translation has advanced significantly, there is still much room for further improvements relating to many natural language processing tasks such as word segmentation, word alignment and parsing. Human language is composed of sequences of meaningful units. These sequences can be words, phrases, sentences or even articles serving as basic elements in...

متن کامل

MTTK: An Alignment Toolkit for Statistical Machine Translation

The MTTK alignment toolkit for statistical machine translation can be used for word, phrase, and sentence alignment of parallel documents. It is designed mainly for building statistical machine translation systems, but can be exploited in other multi-lingual applications. It provides computationally efficient alignment and estimation procedures that can be used for the unsupervised alignment of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Natural Language Engineering

دوره 13  شماره 

صفحات  -

تاریخ انتشار 2007